Massachusetts General Hospital¶
Collaborative effort to investigate the plasma proteomic signatures of COVID-19 positive patients¶
Data provided by the MGH Emergency Department COVID-19 Cohort (Filbin, Goldberg, Hacohen) with Olink Proteomics. https://www.olink.com/mgh-covid-study/
Initial analysis¶
Plasma proteomics reveals tissue-specific cell death and mediators of cell-cell interactions in severe COVID-19 patients¶
https://www.biorxiv.org/content/10.1101/2020.11.02.365536v2
By analyzing several thousand plasma proteins in 306 COVID-19 patients and 78 symptomatic controls over serial timepoints using two complementary approaches, we uncover COVID-19 host immune and non-immune proteins not previously linked to this disease. Integration of plasma proteomics with nine published scRNAseq datasets shows that SARS-CoV-2 infection upregulates monocyte/macrophage, plasmablast, and T cell effector proteins. By comparing patients who died to severely ill patients who survived, we identify dynamic immunomodulatory and tissue-associated proteins associated with survival, providing insights into which host responses are beneficial and which are detrimental to survival.
Proposed Analyses¶
We reproduce the results obtained in Filbin et al 2020 using the Olink data (timepoint D0)
We compare COVID-19 positive patients to identify biomarkers of severity (timepoint D0)
Note: To define the different severity groups we make use of the WHO scores assigned to the patients.
[1]:
import os
from ckg.analytics_core.analytics import analytics
from ckg.analytics_core.viz import viz
from ckg.graphdb_connector import connector
from ckg.graphdb_builder import builder_utils
from plotly.offline import init_notebook_mode, iplot
%matplotlib inline
init_notebook_mode(connected=True)
c:\users\sande\.conda\envs\pip_rev\lib\site-packages\outdated\utils.py:18: OutdatedPackageWarning: The package pingouin is out of date. Your version is 0.3.11, the latest is 0.3.12.
Set the environment variable OUTDATED_IGNORE=1 to disable these warnings.
**kwargs
WGCNA functions will not work. Module Rpy2 not installed.
R functions will not work. Module Rpy2 not installed.
Reading proteomics and clinical data¶
[2]:
#### Access: Data provided by the MGH Emergency Department COVID-19 Cohort
#### (Filbin, Goldberg, Hacohen) with Olink Proteomics. https://www.olink.com/mgh-covid-study/
olink_file = '../../assets/MGH_COVID_OLINK_NPX.txt'
olink_clin_file = '../../assets/MGH_COVID_Clinical_Info.txt'
Proteomics data¶
[3]:
olink_data = builder_utils.readDataFromCSV(olink_file, sep=';', comment='#')
[4]:
olink_data.head()
[4]:
SampleID | subject_id | Timepoint | Index | OlinkID | UniProt | Assay | MissingFreq | Panel | Panel_Version | PlateID | QC_Warning | LOD | NPX | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1_D0 | 1.0 | D0 | 56 | OID21311 | Q9BTE6 | AARSD1 | 0.04 | ONCOLOGY | 1 | 20200772_Plate5_NEURO_ONC | Pass | 0.7204 | 3.2277 |
1 | 1_D0 | 1.0 | D0 | 56 | OID20921 | Q96IU4 | ABHD14B | 0.06 | NEUROLOGY | 1 | 20200772_Plate5_NEURO_ONC | Pass | 0.5696 | 0.7205 |
2 | 1_D0 | 1.0 | D0 | 56 | OID21280 | P00519 | ABL1 | 0.04 | ONCOLOGY | 1 | 20200772_Plate5_NEURO_ONC | Pass | 0.5313 | 2.6293 |
3 | 1_D0 | 1.0 | D0 | 56 | OID21269 | P09110 | ACAA1 | 0.12 | ONCOLOGY | 1 | 20200772_Plate5_NEURO_ONC | Pass | 2.0588 | 3.2670 |
4 | 1_D0 | 1.0 | D0 | 56 | OID20159 | P16112 | ACAN | 0.04 | CARDIOMETABOLIC | 1 | 20200772_Plate5_CARDIO_INF | Pass | 1.1623 | 2.0308 |
QC and data exploration¶
[5]:
print("Number of proteins:", len(olink_data['UniProt'].unique()))
Number of proteins: 1420
[6]:
olink_data.shape
[6]:
(1148916, 14)
Remove proteins with QC warnings¶
[7]:
olink_data['QC_Warning'].unique()
[7]:
array(['Pass', 'Warning'], dtype=object)
[8]:
olink_data.groupby('QC_Warning').count()['SampleID']
[8]:
QC_Warning
Pass 1136634
Warning 12282
Name: SampleID, dtype: int64
[9]:
olink_data = olink_data[olink_data['QC_Warning'] == 'Pass']
[10]:
olink_data.shape
[10]:
(1136634, 14)
[11]:
print("Number of proteins that passed Olink QC:", len(olink_data['UniProt'].unique()))
Number of proteins that passed Olink QC: 1420
[12]:
print("Total number of patients:", len(olink_data['subject_id'].unique()))
Total number of patients: 384
[13]:
print("Total number of samples:", len(olink_data['SampleID'].unique()))
Total number of samples: 786
[14]:
olink_data.shape
[14]:
(1136634, 14)
Remove rows with missing subject id¶
[15]:
olink_data = olink_data.dropna(subset=['subject_id'])
[16]:
olink_data['subject_id'] = olink_data['subject_id'].astype('int').astype('str')
[17]:
olink_data.shape
[17]:
(1108054, 14)
Merge protein names and identifiers¶
[18]:
olink_data['identifier'] = olink_data['Assay'] +"~"+olink_data['UniProt']
[19]:
olink_data.head()
[19]:
SampleID | subject_id | Timepoint | Index | OlinkID | UniProt | Assay | MissingFreq | Panel | Panel_Version | PlateID | QC_Warning | LOD | NPX | identifier | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1_D0 | 1 | D0 | 56 | OID21311 | Q9BTE6 | AARSD1 | 0.04 | ONCOLOGY | 1 | 20200772_Plate5_NEURO_ONC | Pass | 0.7204 | 3.2277 | AARSD1~Q9BTE6 |
1 | 1_D0 | 1 | D0 | 56 | OID20921 | Q96IU4 | ABHD14B | 0.06 | NEUROLOGY | 1 | 20200772_Plate5_NEURO_ONC | Pass | 0.5696 | 0.7205 | ABHD14B~Q96IU4 |
2 | 1_D0 | 1 | D0 | 56 | OID21280 | P00519 | ABL1 | 0.04 | ONCOLOGY | 1 | 20200772_Plate5_NEURO_ONC | Pass | 0.5313 | 2.6293 | ABL1~P00519 |
3 | 1_D0 | 1 | D0 | 56 | OID21269 | P09110 | ACAA1 | 0.12 | ONCOLOGY | 1 | 20200772_Plate5_NEURO_ONC | Pass | 2.0588 | 3.2670 | ACAA1~P09110 |
4 | 1_D0 | 1 | D0 | 56 | OID20159 | P16112 | ACAN | 0.04 | CARDIOMETABOLIC | 1 | 20200772_Plate5_CARDIO_INF | Pass | 1.1623 | 2.0308 | ACAN~P16112 |
Clinical data¶
[20]:
olink_clin_data = builder_utils.readDataFromCSV(olink_clin_file, sep=';', comment='#')
[21]:
olink_clin_data.head()
[21]:
subject_id | COVID | Age cat | BMI cat | HEART | LUNG | KIDNEY | DIABETES | HTN | IMMUNO | ... | crp_3_cat | ddimer_3_cat | ldh_3_cat | abs_neut_7_cat | abs_lymph_7_cat | abs_mono_7_cat | creat_7_cat | crp_7_cat | ddimer_7_cat | ldh_7_cat | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1 | 1 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1.0 | 1.0 | 1.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | 2 | 1 | 2 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 2.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | 3 | 1 | 3 | 4 | 0 | 1 | 0 | 0 | 0 | 0 | ... | 3.0 | 2.0 | 3.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | 4 | 1 | 1 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 2.0 | 2.0 | 3.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | 5 | 1 | 3 | 3 | 0 | 0 | 0 | 1 | 1 | 0 | ... | 5.0 | 3.0 | NaN | 3.0 | 5.0 | 3.0 | 1.0 | 4.0 | 4.0 | 3.0 |
5 rows × 40 columns
[22]:
plot = viz.get_boxplot_grid(olink_clin_data, identifier="clin_vars", args={"title": "Clinical variables",
"x":"COVID",
"y":"Age cat",
"color":"COVID", "width":600})
iplot(plot.figure)
[23]:
olink_clin_data['subject_id'] = olink_clin_data['subject_id'].astype('str')
[24]:
olink_clin_data.columns.tolist()
[24]:
['subject_id',
'COVID',
'Age cat',
'BMI cat',
'HEART',
'LUNG',
'KIDNEY',
'DIABETES',
'HTN',
'IMMUNO',
'Resp_Symp',
'Fever_Sympt',
'GI_Symp',
'WHO 0',
'WHO 3',
'WHO 7',
'WHO 28',
'WHO max',
'abs_neut_0_cat',
'abs_lymph_0_cat',
'abs_mono_0_cat',
'creat_0_cat',
'crp_0_cat',
'ddimer_0_cat',
'ldh_0_cat',
'Trop_72h',
'abs_neut_3_cat',
'abs_lymph_3_cat',
'abs_mono_3_cat',
'creat_3_cat',
'crp_3_cat',
'ddimer_3_cat',
'ldh_3_cat',
'abs_neut_7_cat',
'abs_lymph_7_cat',
'abs_mono_7_cat',
'creat_7_cat',
'crp_7_cat',
'ddimer_7_cat',
'ldh_7_cat']
Merging both datasets¶
[25]:
data = olink_data.set_index('subject_id').join(olink_clin_data.set_index('subject_id')).reset_index()
[26]:
data.head()
[26]:
subject_id | SampleID | Timepoint | Index | OlinkID | UniProt | Assay | MissingFreq | Panel | Panel_Version | ... | crp_3_cat | ddimer_3_cat | ldh_3_cat | abs_neut_7_cat | abs_lymph_7_cat | abs_mono_7_cat | creat_7_cat | crp_7_cat | ddimer_7_cat | ldh_7_cat | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1_D0 | D0 | 56 | OID21311 | Q9BTE6 | AARSD1 | 0.04 | ONCOLOGY | 1 | ... | 1.0 | 1.0 | 1.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | 1 | 1_D0 | D0 | 56 | OID20921 | Q96IU4 | ABHD14B | 0.06 | NEUROLOGY | 1 | ... | 1.0 | 1.0 | 1.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | 1 | 1_D0 | D0 | 56 | OID21280 | P00519 | ABL1 | 0.04 | ONCOLOGY | 1 | ... | 1.0 | 1.0 | 1.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | 1 | 1_D0 | D0 | 56 | OID21269 | P09110 | ACAA1 | 0.12 | ONCOLOGY | 1 | ... | 1.0 | 1.0 | 1.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | 1 | 1_D0 | D0 | 56 | OID20159 | P16112 | ACAN | 0.04 | CARDIOMETABOLIC | 1 | ... | 1.0 | 1.0 | 1.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 54 columns
[27]:
data.shape
[27]:
(1108054, 54)
1. Covid-19 positive vs negative¶
[28]:
data = data[data['Timepoint'] == "D0"]
[29]:
data.shape
[29]:
(542015, 54)
[30]:
print("Total number of positive patients:", len(data['subject_id'].unique()))
Total number of positive patients: 383
[31]:
df = data[['subject_id', 'SampleID', 'identifier', 'WHO max', 'WHO 0', 'NPX', 'Age cat', 'COVID', 'BMI cat', 'Timepoint', 'HEART']]
[32]:
print("Total number of positive patients:", len(data['subject_id'].unique()))
Total number of positive patients: 383
[33]:
df_wide = analytics.transform_into_wide_format(df, index=['SampleID', 'subject_id'], columns=['identifier'], values='NPX', extra=['WHO max', 'WHO 0', 'Age cat', 'COVID', 'BMI cat','HEART'])
[34]:
df_wide.head()
[34]:
SampleID | subject_id | AARSD1~Q9BTE6 | ABHD14B~Q96IU4 | ABL1~P00519 | ACAA1~P09110 | ACAN~P16112 | ACE2~Q9BYF1 | ACOX1~Q15067 | ACP5~P13686 | ... | YES1~P07947 | YTHDF3~Q7Z739 | ZBTB16~Q05516 | ZBTB17~Q13105 | WHO max | WHO 0 | Age cat | COVID | BMI cat | HEART | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 100_D0 | 100 | 3.2374 | 1.9080 | 3.4624 | 3.7201 | 2.8415 | 1.4208 | 0.3747 | 4.6433 | ... | 5.0236 | 0.4053 | 2.9979 | 1.2529 | 4 | 4 | 1 | 1 | 4 | 0 |
1 | 101_D0 | 101 | 2.1038 | 1.1206 | 2.0700 | 2.0108 | 2.6067 | 0.4767 | -0.0337 | 4.6615 | ... | 3.3785 | 0.2607 | 1.9315 | 1.1605 | 4 | 4 | 2 | 1 | 1 | 0 |
2 | 102_D0 | 102 | 2.7613 | 1.3493 | 2.8657 | 2.8731 | 2.0461 | 0.3986 | 0.2983 | 4.4803 | ... | 4.6267 | 0.4771 | 1.7620 | 0.6840 | 1 | 1 | 5 | 1 | 3 | 0 |
3 | 103_D0 | 103 | 2.6384 | 0.9447 | 1.4727 | 2.6550 | 1.6613 | 0.6749 | 0.1486 | 3.7081 | ... | 2.9520 | 0.4808 | 0.9480 | 0.4823 | 6 | 6 | 1 | 1 | 5 | 0 |
4 | 104_D0 | 104 | 5.3336 | 1.5130 | 2.0466 | 1.8627 | 2.4433 | 1.5736 | -0.1446 | 3.2100 | ... | 2.3919 | 0.0496 | 1.3069 | 1.1057 | 6 | 6 | 5 | 0 | 3 | 1 |
5 rows × 1428 columns
[35]:
df_wide.shape
[35]:
(383, 1428)
[36]:
df_wide.describe()
[36]:
AARSD1~Q9BTE6 | ABHD14B~Q96IU4 | ABL1~P00519 | ACAA1~P09110 | ACAN~P16112 | ACE2~Q9BYF1 | ACOX1~Q15067 | ACP5~P13686 | ACP6~Q9NPH0 | ACTA2~P62736 | ... | YES1~P07947 | YTHDF3~Q7Z739 | ZBTB16~Q05516 | ZBTB17~Q13105 | WHO max | WHO 0 | Age cat | COVID | BMI cat | HEART | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 380.000000 | 380.000000 | 380.000000 | 380.000000 | 378.000000 | 374.000000 | 378.000000 | 383.000000 | 383.000000 | 374.000000 | ... | 380.000000 | 378.000000 | 380.000000 | 374.000000 | 383.000000 | 383.000000 | 383.000000 | 383.000000 | 383.000000 | 383.000000 |
mean | 3.532293 | 1.842658 | 2.760268 | 3.463129 | 2.477235 | 1.468916 | 0.137929 | 4.213379 | 3.877484 | 0.186726 | ... | 3.582903 | 0.771492 | 1.751017 | 1.423486 | 3.498695 | 3.832898 | 3.219321 | 0.796345 | 2.480418 | 0.185379 |
std | 1.400306 | 0.953734 | 0.991535 | 1.378123 | 0.566340 | 0.933315 | 0.371251 | 0.664486 | 0.799831 | 0.902741 | ... | 1.538310 | 0.839493 | 0.975425 | 0.676457 | 1.510651 | 1.254631 | 1.208150 | 0.403242 | 1.152738 | 0.389113 |
min | 0.395300 | -0.416000 | 0.722100 | 0.541900 | 0.943900 | -0.304900 | -1.231400 | 2.751800 | -0.470600 | -1.322600 | ... | 0.883800 | -1.478300 | 0.176700 | -0.153700 | 1.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 2.523550 | 1.208275 | 2.030875 | 2.616600 | 2.053875 | 0.855425 | -0.075375 | 3.718600 | 3.374700 | -0.405800 | ... | 2.253875 | 0.206750 | 1.057350 | 0.978850 | 2.000000 | 2.000000 | 2.000000 | 1.000000 | 2.000000 | 0.000000 |
50% | 3.227700 | 1.663900 | 2.676800 | 3.310600 | 2.500100 | 1.253250 | 0.145550 | 4.222300 | 3.888400 | -0.038550 | ... | 3.395650 | 0.719950 | 1.520150 | 1.330750 | 4.000000 | 4.000000 | 3.000000 | 1.000000 | 2.000000 | 0.000000 |
75% | 4.082625 | 2.237375 | 3.296025 | 4.087600 | 2.840300 | 1.855025 | 0.374600 | 4.653950 | 4.314900 | 0.432250 | ... | 4.673750 | 1.309325 | 2.259025 | 1.771600 | 4.000000 | 5.000000 | 4.000000 | 1.000000 | 3.000000 | 0.000000 |
max | 8.334200 | 5.339900 | 6.507100 | 9.427600 | 4.161700 | 6.196200 | 1.760600 | 6.656400 | 7.554500 | 4.327900 | ... | 8.330600 | 4.298200 | 5.154800 | 4.485200 | 6.000000 | 6.000000 | 5.000000 | 1.000000 | 5.000000 | 1.000000 |
8 rows × 1426 columns
[ ]:
[37]:
result = analytics.run_correlation(df_wide, alpha=0.05, subject='subject_id', group="COVID", method="spearman", correction='fdr_bh')
[38]:
net = viz.get_network(result, identifier='corr_net', args={'source':'node1', 'target':'node2',
'weight':'weight', 'values':'weight',
'title':'Correlation Network',
'color_weight': True, 'node_size':'degree',
'cutoff': 0.5, 'cutoff_abs':True})
[39]:
viz.visualize_notebook_network(net['notebook'])
[40]:
df_wide.to_csv('olink_data_NPX_values_WHO_max.tsv', sep='\t', index=False, header=True, doublequote=False)
Imputation using mixed model: KNN when 60% valid values or Probabilistic Minimum Imputation otherwise
Imputation per group based on WHO 0
[41]:
df_wide = analytics.imputation_mixed_norm_KNN(df_wide, index_cols=['WHO 0', 'WHO max', 'SampleID', 'subject_id', 'Age cat', 'COVID', 'BMI cat', 'HEART'], shift=1.8, nstd=0.3, group='COVID', cutoff=0.6)
[42]:
df_wide.head()
[42]:
AARSD1~Q9BTE6 | ABHD14B~Q96IU4 | ABL1~P00519 | ACAA1~P09110 | ACAN~P16112 | ACE2~Q9BYF1 | ACOX1~Q15067 | ACP5~P13686 | ACP6~Q9NPH0 | ACTA2~P62736 | ... | WISP2~O76076 | WNT9A~O14904 | WWP2~O00308 | XCL1~P47992 | XG~P55808 | XPNPEP2~O43895 | YES1~P07947 | YTHDF3~Q7Z739 | ZBTB16~Q05516 | ZBTB17~Q13105 | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
WHO 0 | WHO max | SampleID | subject_id | Age cat | COVID | BMI cat | HEART | |||||||||||||||||||||
4 | 4 | 100_D0 | 100 | 1 | 1 | 4 | 0 | 3.2374 | 1.9080 | 3.4624 | 3.7201 | 2.8415 | 1.4208 | 0.3747 | 4.6433 | 4.0392 | -0.4904 | ... | 4.4080 | 0.1052 | 5.0765 | 2.5597 | 1.7649 | 5.1321 | 5.0236 | 0.4053 | 2.9979 | 1.2529 |
101_D0 | 101 | 2 | 1 | 1 | 0 | 2.1038 | 1.1206 | 2.0700 | 2.0108 | 2.6067 | 0.4767 | -0.0337 | 4.6615 | 3.4319 | -0.3363 | ... | 3.8144 | 0.4324 | 3.8148 | 2.0496 | 1.4435 | 4.4233 | 3.3785 | 0.2607 | 1.9315 | 1.1605 | ||
1 | 1 | 102_D0 | 102 | 5 | 1 | 3 | 0 | 2.7613 | 1.3493 | 2.8657 | 2.8731 | 2.0461 | 0.3986 | 0.2983 | 4.4803 | 3.1530 | -0.3694 | ... | 4.6518 | 0.4650 | 4.7405 | 1.3652 | 1.6223 | 4.4243 | 4.6267 | 0.4771 | 1.7620 | 0.6840 |
6 | 6 | 103_D0 | 103 | 1 | 1 | 5 | 0 | 2.6384 | 0.9447 | 1.4727 | 2.6550 | 1.6613 | 0.6749 | 0.1486 | 3.7081 | 3.6198 | -0.9642 | ... | 4.2031 | 0.2012 | 3.9818 | 1.7696 | 1.7539 | 2.2632 | 2.9520 | 0.4808 | 0.9480 | 0.4823 |
104_D0 | 104 | 5 | 0 | 3 | 1 | 5.3336 | 1.5130 | 2.0466 | 1.8627 | 2.4433 | 1.5736 | -0.1446 | 3.2100 | 3.8780 | 1.3395 | ... | 5.3859 | 1.1573 | 4.0152 | 2.4125 | 2.0343 | 3.7573 | 2.3919 | 0.0496 | 1.3069 | 1.1057 |
5 rows × 1420 columns
[43]:
df_wide.to_csv('olink_data_NPX_values_imputed_KNN.tsv', sep='\t', index=False, header=True, doublequote=False)
[44]:
df_wide = df_wide.reset_index()
[45]:
df_wide['COVID'] = ['COVID-19 positive' if c else 'COVID-19 negative' for c in df_wide['COVID']]
[46]:
pca, args = analytics.run_pca(df_wide, drop_cols=['SampleID', 'subject_id', 'Age cat', 'WHO max', 'BMI cat', 'HEART'], group='COVID', annotation_cols=['SampleID'])
args['group'] = 'group'
args['hovering_cols'] = ['SampleID']
args['factor'] = 250
args['loadings'] = 15
args['title'] = 'Olink data'
[47]:
figure = viz.get_pca_plot(pca, identifier='pca', args=args)
iplot(figure.figure)
[48]:
pca, args = analytics.run_pca(df_wide, drop_cols=['SampleID', 'subject_id', 'Age cat', 'COVID', 'BMI cat', 'HEART', 'WHO max'], group='WHO 0', annotation_cols=['SampleID'])
args['group'] = 'group'
args['hovering_cols'] = ['SampleID']
args['factor'] = 250
args['loadings'] = 15
args['title'] = 'Olink data coloured by WHO score at timepoint D0'
[49]:
figure = viz.get_pca_plot(pca, identifier='pca', args=args)
iplot(figure.figure)
[50]:
pca, args = analytics.run_pca(df_wide, drop_cols=['SampleID', 'subject_id', 'Age cat', 'COVID', 'BMI cat', 'HEART', 'WHO 0'], group='WHO max', annotation_cols=['SampleID'])
args['group'] = 'group'
args['hovering_cols'] = ['SampleID']
args['factor'] = 250
args['loadings'] = 15
args['title'] = 'Olink data coloured by maximum WHO score'
[51]:
figure = viz.get_pca_plot(pca, identifier='pca', args=args)
iplot(figure.figure)
[52]:
pca, args = analytics.run_pca(df_wide, drop_cols=['SampleID', 'subject_id', 'WHO max', 'COVID', 'BMI cat', 'HEART'], group='Age cat', annotation_cols=['SampleID'])
args['group'] = 'group'
args['hovering_cols'] = ['SampleID']
args['factor'] = 250
args['loadings'] = 15
args['title'] = 'Olink data coloured by age category'
[53]:
figure = viz.get_pca_plot(pca, identifier='pca', args=args)
iplot(figure.figure)
[54]:
pca, args = analytics.run_pca(df_wide, drop_cols=['SampleID', 'subject_id', 'WHO max', 'COVID', 'BMI cat'], group='HEART', annotation_cols=['SampleID'])
args['group'] = 'group'
args['hovering_cols'] = ['SampleID']
args['factor'] = 350
args['loadings'] = 20
args['title'] = 'Olink data coloured by HEART'
[55]:
figure = viz.get_pca_plot(pca, identifier='pca', args=args)
iplot(figure.figure)
[56]:
pca, args = analytics.run_pca(df_wide, drop_cols=['SampleID', 'subject_id', 'WHO max', 'COVID', 'HEART'], group='BMI cat', annotation_cols=['SampleID'])
args['group'] = 'group'
args['hovering_cols'] = ['SampleID']
args['factor'] = 350
args['loadings'] = 20
args['title'] = 'Olink data coloured by BMI cat'
[57]:
figure = viz.get_pca_plot(pca, identifier='pca', args=args)
iplot(figure.figure)
Differential regulation: Covid-19 positive vs negative¶
[58]:
ancova_results = analytics.run_ancova(df_wide, covariates=['Age cat', 'HEART', 'BMI cat'], drop_cols=['SampleID', 'subject_id', 'WHO max'], subject='subject_id', group='COVID', is_logged=True)
[59]:
volcano_plot = viz.run_volcano(ancova_results, identifier='proteomics_volcanos',
args={'alpha':0.01, 'fc':2, 'num_annotations': 50,
'colorscale':'Blues', 'showscale': False,
'marker_size':8, 'x_title':'log2FC', 'y_title':'-log10(pvalue)'})
for plot in volcano_plot:
iplot(plot.figure)
Functional enrichment¶
[60]:
go_terms_query = "MATCH (p:Protein)-[]-(bp:Biological_process) WHERE (p.name+'~'+p.id) IN {} RETURN DISTINCT (p.name+'~'+p.id) AS identifier,bp.name AS annotation"
go_terms_query = go_terms_query.format(df_wide.columns.tolist())
annotation = connector.run_query(go_terms_query)
[61]:
annotation.head()
[61]:
annotation | identifier | |
---|---|---|
0 | mitochondrial genome maintenance | AKT3~Q9Y243 |
1 | mitochondrial genome maintenance | TYMP~P19971 |
2 | regulation of DNA recombination | IL7R~P16871 |
3 | cell wall mannoprotein biosynthetic process | MPI~P34949 |
4 | very long-chain fatty acid metabolic process | ACAA1~P09110 |
[62]:
annotation.shape
[62]:
(19230, 2)
[63]:
enrichment_results = analytics.run_up_down_regulation_enrichment(ancova_results, annotation, identifier='identifier', groups=['group1', 'group2'], annotation_col='annotation', reject_col='rejected', group_col='group', method='fisher', correction='fdr_bh', alpha=0.01, lfc_cutoff=1)
[64]:
figures = viz.get_enrichment_plots(enrichment_results, identifier='enrichment', args={'width':1800})
for fig in figures:
iplot(fig.figure)
[ ]:
[65]:
df_wide = df_wide[df_wide['COVID'] == 'COVID-19 positive']
[66]:
df_wide['WHO 0'] = df_wide['WHO 0'].astype('str')
df_wide['WHO max'] = df_wide['WHO max'].astype('str')
[67]:
df_wide.shape
[67]:
(305, 1428)
[68]:
ancova_results = analytics.run_ancova(df_wide, covariates=['Age cat', 'HEART', 'BMI cat'], drop_cols=['SampleID', 'subject_id', 'COVID', 'WHO 0'], subject='subject_id', group='WHO max', is_logged=True)
[69]:
volcano_plot = viz.run_volcano(ancova_results, identifier='proteomics_volcanos',
args={'alpha':0.01, 'fc':2, 'num_annotations': 50,
'colorscale':'Blues', 'showscale': False,
'marker_size':8, 'x_title':'log2FC', 'y_title':'-log10(pvalue)'})
for plot in volcano_plot:
iplot(plot.figure)
[70]:
go_terms_query = "MATCH (p:Protein)-[]-(bp:Biological_process) WHERE (p.name+'~'+p.id) IN {} RETURN DISTINCT (p.name+'~'+p.id) AS identifier,bp.name AS annotation"
go_terms_query = go_terms_query.format(df_wide.columns.tolist())
annotation = connector.run_query(go_terms_query)
[71]:
annotation.head()
[71]:
annotation | identifier | |
---|---|---|
0 | mitochondrial genome maintenance | AKT3~Q9Y243 |
1 | mitochondrial genome maintenance | TYMP~P19971 |
2 | regulation of DNA recombination | IL7R~P16871 |
3 | cell wall mannoprotein biosynthetic process | MPI~P34949 |
4 | very long-chain fatty acid metabolic process | ACAA1~P09110 |
[72]:
annotation.shape
[72]:
(19230, 2)
[73]:
enrichment_results = analytics.run_up_down_regulation_enrichment(ancova_results, annotation, identifier='identifier', groups=['group1', 'group2'], annotation_col='annotation', reject_col='rejected', group_col='group', method='fisher', correction='fdr_bh', alpha=0.01, lfc_cutoff=1)
c:\users\sande\.conda\envs\pip_rev\lib\site-packages\pandas\core\frame.py:6692: FutureWarning:
Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.
To accept the future behavior, pass 'sort=False'.
To retain the current behavior and silence the warning, pass 'sort=True'.
c:\users\sande\.conda\envs\pip_rev\lib\site-packages\pandas\core\frame.py:6692: FutureWarning:
Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.
To accept the future behavior, pass 'sort=False'.
To retain the current behavior and silence the warning, pass 'sort=True'.
c:\users\sande\.conda\envs\pip_rev\lib\site-packages\pandas\core\frame.py:6692: FutureWarning:
Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.
To accept the future behavior, pass 'sort=False'.
To retain the current behavior and silence the warning, pass 'sort=True'.
c:\users\sande\.conda\envs\pip_rev\lib\site-packages\pandas\core\frame.py:6692: FutureWarning:
Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.
To accept the future behavior, pass 'sort=False'.
To retain the current behavior and silence the warning, pass 'sort=True'.
c:\users\sande\.conda\envs\pip_rev\lib\site-packages\pandas\core\frame.py:6692: FutureWarning:
Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.
To accept the future behavior, pass 'sort=False'.
To retain the current behavior and silence the warning, pass 'sort=True'.
c:\users\sande\.conda\envs\pip_rev\lib\site-packages\pandas\core\frame.py:6692: FutureWarning:
Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.
To accept the future behavior, pass 'sort=False'.
To retain the current behavior and silence the warning, pass 'sort=True'.
[74]:
figures = viz.get_enrichment_plots(enrichment_results, identifier='enrichment', args={'width':1800})
for fig in figures:
iplot(fig.figure)
[ ]:
[ ]: